Which companies use R


drawing For behavior analysis related to status updates and profile pictures.
drawing For advertising effectiveness and economic forecasting.
drawing Acquired Revolution R company and use it for a various purposes.
drawing For data visualization and semantic clustering.
drawing For statistical analysis.
drawing Scale data science.
drawing For data curation, analysis and visualisation.
And many more…

Why use R, Even python can do all this stuff

Think of R like a cat!
And Python like dog!
Both are great pets to have, Some people like one over the other. But at the end of the day both are amazing
The problem starts when someone looks at R and expects it to be a dog
“You’re dog is broken!”
R has some strange parts, but it compenstates with some great parts. They are not just good, but great
Some parts of R are better than python and some parts of python are better than R.

Acquiring the data

Data can be acquired from many sources into R. R supports data formats like csv, xlsx, spss, sas or any remote database like MySQL, SQLite, PostgreSQL, MonetDB, etc

The most used methods are to read data from a csv, xlxs or txt file or connecting to MySQL or SQLite data base
drawing Used for obtaining rectangular data into R like “csv”, “tsv”, and “fwf”
drawing Used to import excel files into R
drawing R interface to Apache Spark to work with big data
drawing Manage Google Drive files from R.
drawing Interact with Google Sheets from R.
drawing This package is wrapped around the ‘xml2’ and ‘httr’ packages to make it easy to download and manipulate

Reading local data

We can read a .csv data using the base read.csv() function or using read_csv() function from the readr package

data <- read.csv("datasets/adult_data.csv")
names(data) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "gender", "capital_gain", "capital_loss", "hours_per_week", "native_country", "predictive_variable")
head(data)
##   age         workclass fnlwgt  education education_num
## 1  50  Self-emp-not-inc  83311  Bachelors            13
## 2  38           Private 215646    HS-grad             9
## 3  53           Private 234721       11th             7
## 4  28           Private 338409  Bachelors            13
## 5  37           Private 284582    Masters            14
## 6  49           Private 160187        9th             5
##           marital_status         occupation   relationship   race  gender
## 1     Married-civ-spouse    Exec-managerial        Husband  White    Male
## 2               Divorced  Handlers-cleaners  Not-in-family  White    Male
## 3     Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
## 4     Married-civ-spouse     Prof-specialty           Wife  Black  Female
## 5     Married-civ-spouse    Exec-managerial           Wife  White  Female
## 6  Married-spouse-absent      Other-service  Not-in-family  Black  Female
##   capital_gain capital_loss hours_per_week native_country
## 1            0            0             13  United-States
## 2            0            0             40  United-States
## 3            0            0             40  United-States
## 4            0            0             40           Cuba
## 5            0            0             40  United-States
## 6            0            0             16        Jamaica
##   predictive_variable
## 1               <=50K
## 2               <=50K
## 3               <=50K
## 4               <=50K
## 5               <=50K
## 6               <=50K

In order to obtain data from remote database like SQLLite First we need to establish a connection to the database

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

Then we can use this connection object to access and edit the database

dbListTables(con)
## [1] "iris"   "mtcars"
mtcarsData <- dbReadTable(con, "mtcars")
str(mtcarsData)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dbDisconnect(con)

Cleaning the data

In the real world the data is not always “clean”. There are many ways to define clean. Common things to look out for:


drawing dplyr is one of the most used package for data wrangling in R
drawing Also a very popular package for data wrangling
drawing Used for string manupulation
drawing Used to work with dates data
drawing Used to work with time data

Understanding the data

Understand every column of the data first

Numerical data fields

Age

ggplot(data, aes(x = data$age)) + geom_bar()

Hours worked per week

ggplot(data, aes(x = data$hours_per_week)) + geom_histogram(binwidth=10)

Categorical data fields

Marital Status

data_marital_status <- data %>% group_by(marital_status) %>% summarise(count = n())
ggplotly(ggplot(data_marital_status, aes(x = reorder(marital_status, count), y = count)) + geom_col() + coord_flip())
ggplot(data_marital_status, aes(x = "", y = count, fill = reorder(marital_status, - count)))+
    geom_bar(width = 1, stat = "identity") +
    coord_polar("y", start=0)

Education

ggplotly(ggplot(data %>% group_by(education) %>% summarise(count = n()), aes(x = reorder(education, count), y = count)) + geom_col() + coord_flip())

Occupation

ggplotly(ggplot(data %>% group_by(occupation) %>% summarise(count = n()), aes(x = reorder(occupation, count), y = count)) + geom_col() + coord_flip())

Relationship

ggplotly(ggplot(data %>% group_by(relationship) %>% summarise(count = n()), aes(x = reorder(relationship, count), y = count)) + geom_col() + coord_flip())

Race

ggplotly(ggplot(data %>% group_by(race) %>% summarise(count = n()), aes(x = reorder(race, count), y = count)) + geom_col() + coord_flip())

Gender

ggplotly(ggplot(data %>% group_by(gender) %>% summarise(count = n()), aes(x = reorder(gender, count), y = count)) + geom_col() + coord_flip())

Native Country

ggplotly(ggplot(data %>% group_by(native_country) %>% summarise(count = n()), aes(x = reorder(native_country, count), y = count)) + geom_col() + coord_flip())

Now try to make hypothesis and test them

Hypothesis 1

People who study more make more money

data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
##    education       number_of_rich_people number_of_poor_people total_people
##    <fct>                           <dbl>                 <dbl>        <int>
##  1 " 10th"                           933                     0          933
##  2 " 11th"                          1175                     0         1175
##  3 " 12th"                           433                     0          433
##  4 " 1st-4th"                        168                     0          168
##  5 " 5th-6th"                        333                     0          333
##  6 " 7th-8th"                        646                     0          646
##  7 " 9th"                            514                     0          514
##  8 " Assoc-acdm"                    1067                     0         1067
##  9 " Assoc-voc"                     1382                     0         1382
## 10 " Bachelors"                     5354                     0         5354
## 11 " Doctorate"                      413                     0          413
## 12 " HS-grad"                      10501                     0        10501
## 13 " Masters"                       1723                     0         1723
## 14 " Preschool"                       51                     0           51
## 15 " Prof-school"                    576                     0          576
## 16 " Some-college"                  7291                     0         7291
print(unique(data$predictive_variable))
## [1]  <=50K  >50K 
## Levels:  <=50K  >50K
print(unique(as.character(data$predictive_variable)))
## [1] " <=50K" " >50K"
salary <- unique(as.character(data$predictive_variable))
str_sub(salary, 2, str_length(salary))
## [1] "<=50K" ">50K"
gsub(" ", "", salary)
## [1] "<=50K" ">50K"
trimws(salary)
## [1] "<=50K" ">50K"
salary
## [1] " <=50K" " >50K"
library(microbenchmark)

microbenchmark(str_sub(salary, 2, str_length(salary)), gsub(" ", "", salary), trimws(salary))
## Unit: microseconds
##                                    expr   min     lq    mean median     uq
##  str_sub(salary, 2, str_length(salary))   2.9   3.50   4.587   4.30   4.70
##                   gsub(" ", "", salary)   4.7   5.10   6.131   6.10   6.50
##                          trimws(salary) 147.3 148.65 153.570 149.65 151.85
##    max neval cld
##   26.1   100  a 
##   17.7   100  a 
##  229.2   100   b
salary <- str_sub(salary, 2, str_length(salary))
salary
## [1] "<=50K" ">50K"
data$predictive_variable <- str_sub(data$predictive_variable, 2, str_length(data$predictive_variable))

data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>% group_by(education) %>% summarise(number_of_rich_people = sum(is_rich), number_of_poor_people = n() - number_of_rich_people, total_people = n())
education_summary
## # A tibble: 16 x 4
##    education       number_of_rich_people number_of_poor_people total_people
##    <fct>                           <dbl>                 <dbl>        <int>
##  1 " 10th"                            62                   871          933
##  2 " 11th"                            60                  1115         1175
##  3 " 12th"                            33                   400          433
##  4 " 1st-4th"                          6                   162          168
##  5 " 5th-6th"                         16                   317          333
##  6 " 7th-8th"                         40                   606          646
##  7 " 9th"                             27                   487          514
##  8 " Assoc-acdm"                     265                   802         1067
##  9 " Assoc-voc"                      361                  1021         1382
## 10 " Bachelors"                     2221                  3133         5354
## 11 " Doctorate"                      306                   107          413
## 12 " HS-grad"                       1675                  8826        10501
## 13 " Masters"                        959                   764         1723
## 14 " Preschool"                        0                    51           51
## 15 " Prof-school"                    423                   153          576
## 16 " Some-college"                  1387                  5904         7291
education_data <- distinct(data %>% select(education, education_num)) %>% arrange(education_num)
education_data
##        education education_num
## 1      Preschool             1
## 2        1st-4th             2
## 3        5th-6th             3
## 4        7th-8th             4
## 5            9th             5
## 6           10th             6
## 7           11th             7
## 8           12th             8
## 9        HS-grad             9
## 10  Some-college            10
## 11     Assoc-voc            11
## 12    Assoc-acdm            12
## 13     Bachelors            13
## 14       Masters            14
## 15   Prof-school            15
## 16     Doctorate            16
data$is_rich <- if_else(data$predictive_variable == "<=50K", 0, 1)
education_summary <- data %>%
    group_by(education, education_num) %>%
    summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
    arrange(education_num)
education_summary
## # A tibble: 16 x 3
## # Groups:   education [16]
##    education       education_num percentage_of_rich_people
##    <fct>                   <int>                     <dbl>
##  1 " Preschool"                1                      0   
##  2 " 1st-4th"                  2                      3.57
##  3 " 5th-6th"                  3                      4.80
##  4 " 7th-8th"                  4                      6.19
##  5 " 9th"                      5                      5.25
##  6 " 10th"                     6                      6.65
##  7 " 11th"                     7                      5.11
##  8 " 12th"                     8                      7.62
##  9 " HS-grad"                  9                     16.0 
## 10 " Some-college"            10                     19.0 
## 11 " Assoc-voc"               11                     26.1 
## 12 " Assoc-acdm"              12                     24.8 
## 13 " Bachelors"               13                     41.5 
## 14 " Masters"                 14                     55.7 
## 15 " Prof-school"             15                     73.4 
## 16 " Doctorate"               16                     74.1
ggplotly(ggplot(education_summary, aes(x = reorder(education, education_num), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
ggplotly(ggplot(education_summary, aes(x = education_num, y = percentage_of_rich_people, color = education)) + geom_point())

Hypothesis 2

People who work under government are likely to make more money

occupation_summary <- data %>%
    group_by(occupation) %>%
    summarise(percentage_of_rich_people = sum(is_rich) / n() * 100) %>%
    arrange(percentage_of_rich_people)
occupation_summary
## # A tibble: 15 x 2
##    occupation           percentage_of_rich_people
##    <fct>                                    <dbl>
##  1 " Priv-house-serv"                       0.671
##  2 " Other-service"                         4.16 
##  3 " Handlers-cleaners"                     6.28 
##  4 " ?"                                    10.4  
##  5 " Armed-Forces"                         11.1  
##  6 " Farming-fishing"                      11.6  
##  7 " Machine-op-inspct"                    12.5  
##  8 " Adm-clerical"                         13.5  
##  9 " Transport-moving"                     20.0  
## 10 " Craft-repair"                         22.7  
## 11 " Sales"                                26.9  
## 12 " Tech-support"                         30.5  
## 13 " Protective-serv"                      32.5  
## 14 " Prof-specialty"                       44.9  
## 15 " Exec-managerial"                      48.4
ggplotly(ggplot(occupation_summary, aes(x = reorder(occupation, percentage_of_rich_people), y = percentage_of_rich_people)) + geom_bar(stat = "identity") + coord_flip())
# data$government_job <- if_else(data$occupation %in% c())

Hypothesis 3

People who work more make more money

Hypothesis 4

Men make more money than Women?

Hypothesis 5

Hypothesis 6